42 research outputs found
Online estimation of the geometric median in Hilbert spaces : non asymptotic confidence balls
Estimation procedures based on recursive algorithms are interesting and
powerful techniques that are able to deal rapidly with (very) large samples of
high dimensional data. The collected data may be contaminated by noise so that
robust location indicators, such as the geometric median, may be preferred to
the mean. In this context, an estimator of the geometric median based on a fast
and efficient averaged non linear stochastic gradient algorithm has been
developed by Cardot, C\'enac and Zitt (2013). This work aims at studying more
precisely the non asymptotic behavior of this algorithm by giving non
asymptotic confidence balls. This new result is based on the derivation of
improved rates of convergence as well as an exponential inequality for
the martingale terms of the recursive non linear Robbins-Monro algorithm
A fast and recursive algorithm for clustering large datasets with -medians
Clustering with fast algorithms large samples of high dimensional data is an
important challenge in computational statistics. Borrowing ideas from MacQueen
(1967) who introduced a sequential version of the -means algorithm, a new
class of recursive stochastic gradient algorithms designed for the -medians
loss criterion is proposed. By their recursive nature, these algorithms are
very fast and are well adapted to deal with large samples of data that are
allowed to arrive sequentially. It is proved that the stochastic gradient
algorithm converges almost surely to the set of stationary points of the
underlying loss criterion. A particular attention is paid to the averaged
versions, which are known to have better performances, and a data-driven
procedure that allows automatic selection of the value of the descent step is
proposed.
The performance of the averaged sequential estimator is compared on a
simulation study, both in terms of computation speed and accuracy of the
estimations, with more classical partitioning techniques such as -means,
trimmed -means and PAM (partitioning around medoids). Finally, this new
online clustering technique is illustrated on determining television audience
profiles with a sample of more than 5000 individual television audiences
measured every minute over a period of 24 hours.Comment: Under revision for Computational Statistics and Data Analysi
Digital search trees and chaos game representation
In this paper, we consider a possible representation of a DNA sequence in a
quaternary tree, in which on can visualize repetitions of subwords. The
CGR-tree turns a sequence of letters into a digital search tree (DST), obtained
from the suffixes of the reversed sequence. Several results are known
concerning the height and the insertion depth for DST built from i.i.d.
successive sequences. Here, the successive inserted wors are strongly
dependent. We give the asymptotic behaviour of the insertion depth and of the
length of branches for the CGR-tree obtained from the suffixes of reversed
i.i.d. or Markovian sequence. This behaviour turns out to be at first order the
same one as in the case of independent words. As a by-product, asymptotic
results on the length of longest runs in a Markovian sequence are obtained
Variable length Markov chains and dynamical sources
Infinite random sequences of letters can be viewed as stochastic chains or as
strings produced by a source, in the sense of information theory. The
relationship between Variable Length Markov Chains (VLMC) and probabilistic
dynamical sources is studied. We establish a probabilistic frame for context
trees and VLMC and we prove that any VLMC is a dynamical source for which we
explicitly build the mapping. On two examples, the ``comb'' and the ``bamboo
blossom'', we find a necessary and sufficient condition for the existence and
the unicity of a stationary probability measure for the VLMC. These two
examples are detailed in order to provide the associated Dirichlet series as
well as the generating functions of word occurrences.Comment: 45 pages, 15 figure
Dynamical Systems in the Analysis of Biological Sequences
The Chaos Game Representation (CGR) maps a sequence of letters taken from a finite alphabet onto the unit square in . While it is a popular tool, few mathematical results have been proved to date. In this report, we show that the CGR gives rise to a limit measure, assuming only the input sequence is stationary ergodic. Some more precise properties are given in the i.i.d. and Markov cases. A new family of statistical tests to characterize the randomness of the inputs is proposed and analyzed. Finally, some basic properties of the CGR are used to generalize the notion of genomic signatur
Persistent random walks, variable length Markov chains and piecewise deterministic Markov processes *
International audienceA classical random walk (S t , t â N) is defined by S t := t n=0 X n , where (X n) are i.i.d. When the increments (X n) nâN are a one-order Markov chain, a short memory is introduced in the dynamics of (S t). This so-called " persistent " random walk is nolonger Markovian and, under suitable conditions, the rescaled process converges towards the integrated telegraph noise (ITN) as the timescale and space-scale parameters tend to zero (see [11, 17, 18]). The ITN process is effectively non-Markovian too. The aim is to consider persistent random walks (S t) whose increments are Markov chains with variable order which can be infinite. This variable memory is enlighted by a one-to-one correspondence between (X n) and a suitable Variable Length Markov Chain (VLMC), since for a VLMC the dependency from the past can be unbounded. The key fact is to consider the non Markovian letter process (X n) as the margin of a couple (X n , M n) nâ„0 where (M n) nâ„0 stands for the memory of the process (X n). We prove that, under a suitable rescaling, (S n , X n , M n) converges in distribution towards a time continuous process (S 0 (t), X(t), M (t)). The process (S 0 (t)) is a semi-Markov and Piecewise Deterministic Markov Process whose paths are piecewise linear
Digital search trees and chaos game representation
Version préliminaire (2006) d'un travail publié sous forme définitive (2009).International audienceIn this paper, we consider a possible representation of a DNA sequence in a quaternary tree, in which on can visualize repetitions of subwords. The CGR-tree turns a sequence of letters into a digital search tree (DST), obtained from the suffixes of the reversed sequence. Several results are known concerning the height and the insertion depth for DST built from i.i.d. successive sequences. Here, the successive inserted wors are strongly dependent. We give the asymptotic behaviour of the insertion depth and of the length of branches for the CGR-tree obtained from the suffixes of reversed i.i.d. or Markovian sequence. This behaviour turns out to be at first order the same one as in the case of independent words. As a by-product, asymptotic results on the length of longest runs in a Markovian sequence are obtained